August 08, 2019
R is:
R is an environment where you can read data in, carry out data manipulation, analyse your data, and prepare your results for others to see.
R is an environment where you can:
- read (pull) data in (from a variety of sources),
- carry out (very complex) data manipulation,
- analyse your data (via a broad and rapidly increasing suite of statistical models), and
- prepare (reproducibly) your results (figures, tables, reports, papers) for others to see (or interact with).
“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
R is not a:
<-’
a <- 3
a will now be 3 (which we’ll see shortly)=’ in place of ‘<-’=’ within brackets
3+3
## [1] 6
3*3
## [1] 9
a <- 3 a + a
## [1] 6
(something) | (x, y)
sum(1, 2, 3, 4, NA, 5, na.rm = T)
## [1] 15
[something] | [ ] [,] [[ ]]
c("one", "two", "three")[2]
## [1] "two"
{something}
## [1] "matt is awesome"
: used to create a sequence
1:4 produces 1 2 3 4# For this next bit of code online somewhere
?something for help
T and F are ‘reserved’ - logicals
TRUE and FALSE respectively, don’t overwrite themgetwd() and setwd()modecharacter is the highest-level ‘catch-all’ classcharactercharacter*data points with non-nuermic (alpha) characters can be made into other classes (more to come)
x <- "Teach me R" x
## [1] "Teach me R"
class(x)
## [1] "character"
x <- Teach me R
## Error: <text>:1:12: unexpected symbol ## 1: x <- Teach me ## ^
x <- "Teach me R" y <- "Teach me now" x + y
## Error in x + y: non-numeric argument to binary operator
x <- "3" class(x)
## [1] "character"
numericdouble (double precision floating point numbers - computer format)integer (no decimals)R may convert between the three as it sees fit
“it is perfectly feasible to use R successfully for years and not need to know the answer to this question” - stackoverflow
x <- "3" x
## [1] "3"
class(x)
## [1] "character"
x <- 3 x
## [1] 3
class(x)
## [1] "numeric"
x + 2
## [1] 5
x <- as.integer(3.1) x
## [1] 3
class(x)
## [1] "integer"
x + 0.4
## [1] 3.4
class(x + 0.4)
## [1] "numeric"
factor is used when there is a limited set of responses
labels, associated with levels (the responses)label/levelordered, but don’t have to be
y <- c("Cat", "Other", "Dog", "Dog"); y
## [1] "Cat" "Other" "Dog" "Dog"
class(y)
## [1] "character"
y <- as.factor(c("Cat", "Other", "Dog", "Dog")); y
## [1] Cat Other Dog Dog ## Levels: Cat Dog Other
class(y); levels(y)
## [1] "factor"
## [1] "Cat" "Dog" "Other"
y <- y[-1] # get rid of the cat table(y)
## y ## Cat Dog Other ## 0 2 1
y <- as.factor(y); levels(y)
## [1] "Cat" "Dog" "Other"
y <- factor(y); levels(y)
## [1] "Dog" "Other"
?factoras.factor() does not accept additional arguments
levels defaults.relevel lets you set the ‘baseline’ level
class(tki_demo$intervention)
## [1] "factor"
table(tki_demo$intervention)
## ## Placebo Drug 1 Drug 2 ## 29 38 33
levels(tki_demo$intervention)
## [1] "Placebo" "Drug 1" "Drug 2"
levels(factor(as.character(tki_demo$intervention)))
## [1] "Drug 1" "Drug 2" "Placebo"
y <- tki_demo$intervention[1:5] y # retained class and levels
## [1] Drug 2 Drug 2 Drug 2 Placebo Drug 1 ## Levels: Placebo Drug 1 Drug 2
y <- y[-5] # get rid of the 'Drug 1' table(y)
## y ## Placebo Drug 1 Drug 2 ## 1 0 3
y <- as.factor(y); levels(y)
## [1] "Placebo" "Drug 1" "Drug 2"
y <- factor(y); levels(y)
## [1] "Placebo" "Drug 2"
logical variables can only take two*, TRUE or FALSET and Fif) statementssum(), any(), all()F + F + NA = ?all(c(T, NA, T)) = ?y <- c(TRUE, FALSE, TRUE, FALSE, FALSE) class(y)
## [1] "logical"
table(y)
## y ## FALSE TRUE ## 3 2
F + F + NA
## [1] NA
T + NA + T
## [1] NA
F + F + F
## [1] 0
T + T + T
## [1] 3
Oh!
table(y)
## y ## FALSE TRUE ## 3 2
sum(y)
## [1] 2
all(c(T, NA, T))
## [1] NA
all(c(F, NA, F))
## [1] FALSE
any(c(T, NA, T))
## [1] TRUE
any(c(F, NA, F))
## [1] NA
all(c(T, F, T))
## [1] FALSE
date follows relatively strict formattingPOSIXct, POSIXlt?strptime can be your friend
%d Day of the month as decimal number (01–31).
%e Day of the month as decimal number (1–31), …
%H Hours as decimal number (00–23). …
%y Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19…
%Y Year with century. Note that whereas there was no zero in the original Gregorian calendar…
Sys.Date()
## [1] "2019-07-26"
class(as.Date("2019-01-31"))
## [1] "Date"
class(as.Date("31-01-2019"))
## [1] "Date"
class(as.Date("31.01.2019"))
## Error in charToDate(x): character string is not in a standard unambiguous format
class(as.Date("31.01.2019", format = "%d.%m.%Y"))
## [1] "Date"
library(lubridate)library(lubridate); class(dmy("31.01.2019"))
## [1] "Date"
duration, interval, periodround(interval(ymd("1983-10-16"), Sys.Date()) / years(1), 2)
## [1] 35.78
as.duration(interval(ymd("1983-10-16"), Sys.Date()))
## [1] "1128988800s (~35.78 years)"
as.period(interval(ymd("1983-10-16"), Sys.Date()))
## [1] "35y 9m 10d 0H 0M 0S"
typeclass within RSingle element (scalar typically) - most basic building block
i <- 1
vector is not a classvector HAS a class, and it can have only one! (remember the hierarchy)x <- c(3,4,5,6,8) # c stands for combine/concatenate x
## [1] 3 4 5 6 8
x + 2
## [1] 5 6 7 8 10
length(x)
## [1] 5
sum(x)
## [1] 26
y <- c("Cat", "Dog", "Other", "Dog")
length(y)
## [1] 4
sum(y)
## Error in sum(y): invalid 'type' (character) of argument
data.frametibble is functionally very similar to a data.frame
We’ll come back to this
class(tki_demo)
## [1] "data.frame"
dim(tki_demo)
## [1] 100 8
head(tki_demo) # can you guess what tail does?
## id dob male smoker intervention day1 day2 day3 ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.787324 19.379647 29.63681 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.200292 28.770240 NA ## 3 3 2003-01-05 TRUE FALSE Drug 2 6.321257 22.820082 39.02544 ## 4 4 2002-09-14 FALSE FALSE Placebo -1.302337 4.610366 9.40575 ## 5 5 2003-10-24 FALSE FALSE Drug 1 7.793055 19.879646 14.73297 ## 6 6 2009-03-06 TRUE FALSE Drug 1 11.310929 12.648613 NA
names(tki_demo)
## [1] "id" "dob" "male" "smoker" ## [5] "intervention" "day1" "day2" "day3"
str(tki_demo)
## 'data.frame': 100 obs. of 8 variables: ## $ id : int 1 2 3 4 5 6 7 8 9 10 ... ## $ dob : Date, format: "2004-12-08" "2007-06-14" ... ## $ male : logi TRUE FALSE TRUE FALSE FALSE TRUE ... ## $ smoker : logi FALSE TRUE FALSE FALSE FALSE FALSE ... ## $ intervention: Factor w/ 3 levels "Placebo","Drug 1",..: 3 3 3 1 2 2 3 1 1 1 ... ## $ day1 : num 3.79 1.2 6.32 -1.3 7.79 ... ## $ day2 : num 19.38 28.77 22.82 4.61 19.88 ... ## $ day3 : num 29.64 NA 39.03 9.41 14.73 ...
summary(tki_demo[, 1:4])
## id dob male smoker ## Min. : 1.00 Min. :1899-02-25 Mode :logical Mode :logical ## 1st Qu.: 25.75 1st Qu.:2003-06-06 FALSE:65 FALSE:72 ## Median : 50.50 Median :2005-07-12 TRUE :35 TRUE :28 ## Mean : 50.50 Mean :2004-05-07 ## 3rd Qu.: 75.25 3rd Qu.:2007-04-02 ## Max. :100.00 Max. :2009-04-13
head(tki_demo)
## id dob male smoker intervention day1 day2 day3 ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.787324 19.379647 29.63681 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.200292 28.770240 NA ## 3 3 2003-01-05 TRUE FALSE Drug 2 6.321257 22.820082 39.02544 ## 4 4 2002-09-14 FALSE FALSE Placebo -1.302337 4.610366 9.40575 ## 5 5 2003-10-24 FALSE FALSE Drug 1 7.793055 19.879646 14.73297 ## 6 6 2009-03-06 TRUE FALSE Drug 1 11.310929 12.648613 NA
tki_demo[1:2 , 1:3]
## id dob male ## 1 1 2004-12-08 TRUE ## 2 2 2007-06-14 FALSE
tki_demo[1 , 1]
## [1] 1
tki_demo$day1[1:10]
## [1] 3.787324 1.200292 6.321257 -1.302337 7.793055 11.310929 8.389122 ## [8] 9.046324 6.428155 3.388006
tki_demo$index <- 1:10 tki_demo$index <- 1:nrow(tki_demo) head(tki_demo)
## id dob male smoker intervention day1 day2 day3 ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.787324 19.379647 29.63681 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.200292 28.770240 NA ## 3 3 2003-01-05 TRUE FALSE Drug 2 6.321257 22.820082 39.02544 ## 4 4 2002-09-14 FALSE FALSE Placebo -1.302337 4.610366 9.40575 ## 5 5 2003-10-24 FALSE FALSE Drug 1 7.793055 19.879646 14.73297 ## 6 6 2009-03-06 TRUE FALSE Drug 1 11.310929 12.648613 NA ## index ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6
tibble is the data.frame of the tidyversedata.frame, but less unruly (print)load("../data/dat.RData"); head(tki_demo, 3)
## # A tibble: 3 x 8 ## id dob male smoker intervention day1 day2 day3 ## <int> <date> <lgl> <lgl> <fct> <dbl> <dbl> <dbl> ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.79 19.4 29.6 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.20 28.8 NA ## 3 3 2003-01-05 TRUE FALSE Drug 2 6.32 22.8 39.0
head(tki_demo, 2)
## # A tibble: 2 x 8 ## id dob male smoker intervention day1 day2 day3 ## <int> <date> <lgl> <lgl> <fct> <dbl> <dbl> <dbl> ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.79 19.4 29.6 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.20 28.8 NA
data.frame of data for each calendartki_list <- list(tki_demo,
tki_demo_complications)
class(tki_list)
## [1] "list"
head(tki_list[[1]], 2) # access the first element
## # A tibble: 2 x 8 ## id dob male smoker intervention day1 day2 day3 ## <int> <date> <lgl> <lgl> <fct> <dbl> <dbl> <dbl> ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.79 19.4 29.6 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.20 28.8 NA
head(tki_list[[2]], 2) # access the second element
## # A tibble: 2 x 2 ## id complications ## <int> <chr> ## 1 12 Yoda speach ## 2 22 Man flu
lapply(tki_list, head, 2)
## [[1]] ## # A tibble: 2 x 8 ## id dob male smoker intervention day1 day2 day3 ## <int> <date> <lgl> <lgl> <fct> <dbl> <dbl> <dbl> ## 1 1 2004-12-08 TRUE FALSE Drug 2 3.79 19.4 29.6 ## 2 2 2007-06-14 FALSE TRUE Drug 2 1.20 28.8 NA ## ## [[2]] ## # A tibble: 2 x 2 ## id complications ## <int> <chr> ## 1 12 Yoda speach ## 2 22 Man flu
Matrix (2-dimensions)
data.frames)numeric or character etc)$ for referencing columns ([ , ] used)Array (n-dimensions)
beyond the scope of this level - computational efficiency
.csv, .txt) can be read in with base functions
read.csv() and read.table() are commonly use.xlx, .xlsx) excel files via read_excel()header = T
na.strings = c(“NA”, “missing”, “999”)
NA values, which is very handystringsAsFactors = F
F (ID codes, free text responses) dat <- read.csv(“/DIRECTORY/my_data_file.csv”, header = T, na.strings = c(“NOT CONTACTED”))
or
dat <- read.csv(“../data/my_data_file.csv”, header = T, na.strings = c(“NOT CONTACTED”))
data.frame called dat0-9 or . will mean that column is read in as characterread.csv(“http://some.where.net/data/foo.csv”)readHTMLTable from the XML package to read tables off websitesinstall.packages(“ggplot2”)
library(ggplot)
updated.packages()
ask = T’devtools::install_github()install_github() function within the devtools packagedevtools::install_github(“TelethonKids/biometrics”, build_vignettes = TRUE).libPaths() will show where (on your computer) your packages are storedcitation(“ggplot2”) will tell you how to cite the package?geom_path starts (top left) with geom_path {ggplot2}.R, .Rmd) so you can *always regenerate your results from your original data filessave.image(“my_session.Rdata”)load(“my_session.Rdata”)saveRDS(dat, “my_data_frame.RDS”)readRDS(“my_data_frame.RDS”)write.csv(dat, “my_csv_export.csv”)row.names = F and na = ""Many of these have been seen during the prior examples: - <- assign, puts the right hand side into the left hand side - ? followed by a command, to search for help on a command - a:b used to generate a series of integers, from a to b - - function( , , …) arguments to a function are seperated by commas - c() concatenate, used to create a vector - c() concatenate, used to create a vector - [ ] [,] [[ ]] used to ‘extract’/interact with componets of a
- - -